NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Understanding the Gain from Data Filtering in Multimodal Contrastive Learning

Pareek, Divyansh; Oh, Sewoong; Du, Simon Shaolei (December 2025, 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025))

The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting as the fraction of data with correctly matched modalities among paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: the error without filtering is upper and lower bounded by , and the error with teacher-based filtering is upper bounded by in the large regime, and by in the small regime.
more » « less
Full Text Available
SuperBPE: Space Travel for Language Models

Liu, Alisa; Hayase, Jonathan; Hofmann, Valentin; Oh, Sewoong; Smith, Noah A; Choi, Yejin (October 2025, Conference on Language Modeling)

Full Text Available
Sampling from Your Language Model One Byte at a Time

Hayase, Jonathan; Liu, Alisa; Smith, Noah A; Oh, Sewoong (July 2025, cs.CL)

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at this https URL .
more » « less
Full Text Available
PLeaS-Merging Models with Permutations and Least Squares

Nasery, Anshul; Hayase, Jonathan; Koh, Pang Wei; Oh, Sewoong (June 2025, Proceedings of the Computer Vision and Pattern Recognition Conference)

The democratization of machine learning systems has made the process of fine-tuning accessible to practitioners, leading to a wide range of open-source models fine-tuned on specialized tasks and datasets. Recent work has proposed to merge such models to combine their functionalities. However, prior approaches are usually restricted to models that are fine-tuned from the same base model. Furthermore, the final merged model is typically required to be of the same size as the original models. In this work, we propose a new two-step algorithm to merge models---termed PLeaS---which relaxes these constraints. First, leveraging the Permutation symmetries inherent in the two models, PLeaS partially matches nodes in each layer by maximizing alignment. Next, PLeaS computes the weights of the merged model as a layer-wise Least Squares solution to minimize the approximation error between the features of the merged model and the permuted features of the original models. PLeaS allows a practitioner to merge two models sharing the same architecture into a single performant model of a desired size, even when the two original models are fine-tuned from different base models. We also demonstrate how our method can be extended to address a challenging scenario where no data is available from the fine-tuning domains. We demonstrate our method to merge ResNet and ViT models trained with shared and different label spaces, and show improvement over the state-of-the-art merging methods of up to 15 percentage points for the same target compute while merging models trained on DomainNet and fine-grained classification tasks.
more » « less
Full Text Available
S4S: Solving for a Fast Diffusion Model Solver

Frankel, Eric; Chen, Sitan; Li, Jerry; Koh, Pang Wei; Ratliff, Lillian J; Oh, Sewoong (July 2025, 42nd International Conference on Machine Learning)

Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reverse-time ordinary differential equation (ODE). Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model. However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers. In this work, we propose a new method that learns a good solver for the DM, which we call Solving for the Solver (S4S). S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver. We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling. In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance. Building on top of this, we also propose S4S-Alt, which optimizes both the solver and the discretization schedule. By exploiting the full design space of DM solvers, with 5 NFEs, we achieve an FID of 3.73 on CIFAR10 and 13.26 on MS-COCO, representing a 1.5x improvement over previous training-free ODE methods.
more » « less
Full Text Available
Zeroth-Order Optimization Finds Flat Minima

Zhang, Liang; Li, Bingcong; Thekumparampil, Kiran_Koshy; Oh, Sewoong; Muehlebach, Michael; He, Niao (June 2025, https://doi.org/10.48550/arXiv.2506.05454)

Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known about the implicit regularization that provides a fine-grained characterization of which particular solutions are reached. This paper shows that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, a measure widely used to distinguish between sharp and flat minima. The authors provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, defining flat minima as minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support the theoretical findings.
more » « less
Full Text Available
Foundation model for mass spectrometry proteomics

Sanders, Justin; Yilmaz, Melih; Russell, Jacob_H; Bittremieux, Wout; Fondrie, William_E; Riley, Nicholas_M; Oh, Sewoong; Noble, William_Stafford (May 2025, https://doi.org/10.48550/arXiv.2505.10848)

Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.
more » « less
Full Text Available
Understanding the gains from repeated self-distillation

Pareek, Divyansh; Du, Simon S; Oh, Sewoong (December 2024, 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024))

Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose using the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor of , where is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to %.
more » « less
Full Text Available
Spurious Rewards: Rethinking Training Signals in RLVR

Shao, Rulin; Li, Shuyue_Stella; Xin, Rui; Geng, Scott; Wang, Yiping; Oh, Sewoong; Du, Simon_Shaolei; Lambert, Nathan; Min, Sewon; Krishna, Ranjay; et al (June 2025, cs.AI)

Full Text Available
Label poisoning is all you need

Liu, Xiyang; Jain, Prateek; Kong, Weihao; Oh, Sewoong; Suggala, Arun (December 2024, In 37th Conference on Neural Information Processing Systems (NeurIPS). Advances in Neural Information Processing Systems)

Full Text Available

« Prev Next »

Search for: All records